refactor/metadata package by d-v-b · Pull Request #3919 · zarr-developers/zarr-python

d-v-b · 2026-04-21T22:26:27Z

This PR creates a new subpackage called zarr-metadata just for JSON metadata. it's stored in packages/zarr_metadata. It contains typeddict classes that model the spec-defined JSON forms for v2 array + group metadata, and v3 array + group metadata, including data types, codecs, chunk key encodings and chunk grids. I only included type definitions for metadata that has an external spec. So zarr-python will need to define some types internally for e.g. unspecified data types or codecs.

I would like to publish this subpackage to pypi. These types useful to any python tool that works with zarr data, even if that tool doesn't use zarr-python. It is also useful to zarr-python, because it means we can remove and resolve some lingering questions about publishing types.

If we adopted the changes here, adding a new data type / codec / chunk grid, etc, would require adding types to zarr-metadata, then adding the implementation in zarr-python that work with those types. We wouldn't need to do these 2 operations in the same PR, but I expect that would be the normal practice.

This change does add complexity to our publishing workflow: we need to ensure that zarr-metadata changes are published at or before new zarr-python releases. We should add some checks to ensure that this happens.

Docs for the new package are missing from this PR. I would handle that in a follow-up.

I would appreciate feedback at all levels, including the following topics:

are we aligned on the idea of a subpackage?
are the types correct (do they match the spec documents)
what needs to change in our CI?

closes #3355 and #3795

Adds packages/zarr-metadata, a sibling PyPI package that contains spec-defined Zarr v2 and v3 metadata types as pure-typing artifacts (TypedDicts, type aliases, Final string constants, NewType validators). No runtime logic beyond minimal regex validators for spec-format-locked strings (hex floats, base64 bytes, raw-bytes name). Layout (anything imported from `zarr_metadata.v3.X` is a v3-spec artifact; from `zarr_metadata.v2.X` is v2-spec; only true cross-version primitives sit at the top level): zarr_metadata/ ├── common.py # JSON, NamedConfig, NamedRequiredConfig ├── __init__.py # ArrayMetadata, GroupMetadata unions ├── v2/ │ ├── array.py, group.py, consolidated.py, codec.py └── v3/ ├── array.py, group.py, consolidated.py ├── chunk_grid/ {regular, rectilinear} ├── chunk_key_encoding/ {default, v2} ├── codec/ {blosc, bytes, crc32c, gzip, │ sharding, transpose, zstd} └── data_type/ {bool, int8/16/32/64, uint8/16/32/64, float16/32/64, complex64/128, bytes, string, numpy_datetime64, numpy_timedelta64, struct, raw} zarr-python source is unchanged in this branch. zarr-metadata is shipped as an independent package; a follow-up PR will adopt it inside zarr-python once the package is published to PyPI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…factor/metadata-package

The structural tests constructed dict literals annotated with TypedDict types and asserted that a key we just inserted came back out -- which exercises Python's dict, not the type. The TypedDict has no runtime shape check, so even a type-incompatible dict would pass. Pyright (in CI) is what actually verifies the shapes. Constant-equality tests (e.g. `assert NAME == "name"`) also tested nothing. Kept the validating-constructor tests for hex_float{16,32,64}, base64_bytes, and raw_bytes_dtype_name -- those exercise real regex logic and catch real bugs. Replaced test_imports.py with a single smoke test that confirms the package loads and the top-level union types are reachable. Renamed test_structural.py -> test_validators.py to match what it actually tests. Test count: 47 -> 6, ~750 lines -> 89. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-22T13:30:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.11%. Comparing base (029c376) to head (331ea93).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3919   +/-   ##
=======================================
  Coverage   93.11%   93.11%           
=======================================
  Files          85       85           
  Lines       11365    11365           
=======================================
  Hits        10582    10582           
  Misses        783      783

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2026-04-22T13:38:01Z

if this gets merged, we would need the following changes in subsequent PRs:

setting up CI / CD for zarr-metadata
publishing zarr-metadata to pypi
docs
changes to zarr-python:
- depending on zarr-metadata
- removing internally-defined types
- deprecating re-exported public types
- integrating zarr-python's docs with zarr-metadata's docs

zarr-metadata is a typing-only foundational package consumed by libraries; supporting one Python version below the current minimum widens the audience at minimal cost. The only blocker was PEP 695 generic class syntax in `common.py`: class NamedConfig[TName: str, TConfig: ...](TypedDict): Rewritten to the PEP 484 `Generic[T]` form, which works on 3.11+. The two affected classes carry `# noqa: UP046` comments since the ruff rule pushes toward the newer syntax that we deliberately avoid. All other modern features (PEP 604 `X | Y` unions at runtime, `NotRequired`, `extra_items=` via typing_extensions, `ReadOnly`) already work on 3.11. Verified: package imports and all 6 tests pass on Python 3.11. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

d-v-b · 2026-04-27T21:18:55Z

@zarr-developers/python-core-devs I'd love feedback on this.

maxrjones · 2026-04-27T21:37:12Z

@zarr-developers/python-core-devs I'd love feedback on this.

I really like the idea of having a public interface for metadata classes with minimal dependencies, and am fine with a sub-package as the mechanism for doing that.

The order of operations seems a little off to me. In particular, I'm concerned about drift over the course of the follow-up PRs, and us realizing that we're not happy with the design of zarr-metadata while migrating the core library. The following sequence would make more sense to me:

Open a PR with a design document for public metadata classes
Refactor the internal metadata classes to agree with the agreed upon API
Move internal metadata classes to a subpackage
Release subpackage
Depend on subpackage internally

As an aside, I'd recommend reserving the package names that you're hoping to use on PyPI in anticipation of these changes

d-v-b · 2026-04-27T22:13:01Z

Open a PR with a design document for public metadata classes
Refactor the internal metadata classes to agree with the agreed upon API

The purpose of this PR is not to define metadata classes that zarr-python uses. It is narrowly scoped to define typeddicts that model the contents of zarr metadata documents. So this package doesn't affect anything about the shape of metadata-modelling classes in zarr-python, except by defining the type of their JSON form.

This package would only replace typeddict / literal types that we currently define inside zarr-python. Allowing this package to own those types requires no behavioral changes from zarr-python.

That's why my timeline is roughly "make this package stand alone (exist on pypi), then make zarr-python depend on it"

d-v-b · 2026-04-27T22:31:17Z

n particular, I'm concerned about drift over the course of the follow-up PRs, and us realizing that we're not happy with the design of zarr-metadata while migrating the core library.

What aspects of the code added here do you think might drift? The only changes I anticipate would be fixes for the following issues:

failure to match the spec
failure to use python types ergonomically
name changes / bikeshedding

All of these can be handled gracefully by downstream packages.

As an aside, I'd recommend reserving the package names that you're hoping to use on PyPI in anticipation of these changes

good idea, this is done under my own pypi account. if this PR is approved I would transfer ownership to zarr-developers.

maxrjones · 2026-04-27T23:20:07Z

The purpose of this PR is not to define metadata classes that zarr-python uses.

It seemed from your earlier request for feedback that this subpackage was intended to eventually be useful for zarr-python.

What aspects of the code added here do you think might drift?

Most visibly, I'd like some more thought to be put into the names of the classes. It's inconsistent that some include Metadata and others don't.

d-v-b · 2026-04-27T23:46:11Z

It seemed from your earlier request for feedback that this subpackage was intended to eventually be useful for zarr-python.

The types and typeddicts in this package are useful for zarr python. These types define the JSON structure of the array metadata, including all the chunk grids, codecs, data types, etc. We use this information to model the JSON metadata documents we read and write.

So the impact of this package would basically be upstreaming many typeddict and literal types that zarr-python defines, e.g. this one.

Most visibly, I'd like some more thought to be put into the names of the classes. It's inconsistent that some include Metadata and others don't.

The array metadata, group metadata, and consolidated metadata typddicts are named XMetadata. That's tracking how we talk about these data structures. I'd be fine with using the metadata suffix for other types, e.g. BytesCodec -> BytesCodecMetadata.

maxrjones

I like the direction here. I think we should have a very high bar for accuracy if we're promoting these as the canonical representation for public use.

For naming (recognizing this is bike-shedding), I'd prefer consistent suffix conventions:

Codecs all end in Codec
DataTypes all end in DataType
Chunk grids all end in ChunkGrid

I think this functionality would be best tested with some round-trip property tests

maxrjones · 2026-04-28T00:23:45Z

+    See https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html#data-type-encoding
+    """
+
+    fieldname: str
+    datatype: str
+    shape: NotRequired[tuple[int, ...]]


wouldn't the spec-faithful representation just have a list of lists? the V2 spec doesn't have JSON keys for fieldname or datatype

maxrjones · 2026-04-28T00:26:13Z

+    data_type: MetadataField
+    shape: tuple[int, ...]
+    chunk_grid: MetadataField
+    chunk_key_encoding: MetadataField
+    fill_value: JSON
+    codecs: tuple[MetadataField, ...]
+    attributes: NotRequired[Mapping[str, JSON]]
+    storage_transformers: NotRequired[tuple[MetadataField, ...]]


there are more restrictions on data_type, chunk_grid, chunk_key_encoding, and codecs than implied by MetadataField. This would be more useful if it were more strictly typed. In addition, MetadataField allowing a string shorthand is wrong in the chunk grid case.

maxrjones · 2026-04-28T00:27:06Z

+    """
+
+    name: Crc32cCodecName
+    configuration: NotRequired[dict[str, JSON]]


this typing allows erroneous metadata (e.g., "configuration": {"foo": 1}})

maxrjones · 2026-04-28T00:27:43Z

+    clevel: int
+    shuffle: BloscShuffle
+    blocksize: int
+    typesize: int


typesize should be optional because it is not required with noshuffle

maxrjones · 2026-04-28T00:31:37Z

+
+Codec = str | Mapping[str, object]
+"""
+The widest JSON shape that can specify a codec (v2 or v3).


why have a v2 or v3 codec under the v3 namespace?

maxrjones · 2026-04-28T00:36:09Z

+from zarr_metadata.common import JSON, NamedConfig
+
+
+class AllowedExtraField(TypedDict, extra_items=JSON):  # type: ignore[call-arg]


I'd slightly prefer OptionalExtension as the name here

maxrjones · 2026-04-28T00:39:46Z

+    "INT32_DTYPE_NAME",
+    "Int32DTypeName",
+    "Int32FillValue",


do we really need three exports for each data type/codec? Could we start more minimal and add more if they are needed?

maxrjones · 2026-04-28T00:40:19Z

+DateTimeUnit = Literal[
+    "Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as", "generic"
+]
+"""Time unit codes used by numpy.datetime64."""


this should be declared once and shared between this and timedelta

maxrjones · 2026-04-28T00:41:26Z

+class NamedRequiredConfig(TypedDict, Generic[TName, TConfig]):  # noqa: UP046
+    """
+    Named-config envelope with required configuration.
+
+    Generic with two parameters: name literal and configuration mapping.
+
+    Uses the PEP 484 ``Generic[T]`` form rather than PEP 695 ``[T]`` syntax
+    so the package supports Python 3.11.
+    """
+
+    name: ReadOnly[TName]
+    configuration: ReadOnly[TConfig]


why include this?

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 21, 2026

d-v-b force-pushed the refactor/metadata-package branch from ef4b773 to fbc24f3 Compare April 22, 2026 11:50

d-v-b and others added 2 commits April 22, 2026 13:59

Merge branch 'main' of github.com:zarr-developers/zarr-python into re…

0ae8db9

…factor/metadata-package

d-v-b force-pushed the refactor/metadata-package branch from e6303f5 to 0ae8db9 Compare April 22, 2026 12:00

d-v-b and others added 2 commits April 22, 2026 15:22

Merge branch 'main' into refactor/metadata-package

1b62c4c

d-v-b requested a review from a team April 22, 2026 13:41

d-v-b added the enhancement New features or improvements label Apr 23, 2026

Merge branch 'main' into refactor/metadata-package

331ea93

maxrjones reviewed Apr 28, 2026

View reviewed changes

		from zarr_metadata.common import JSON, NamedConfig


		class AllowedExtraField(TypedDict, extra_items=JSON): # type: ignore[call-arg]

Uh oh!

Conversation

d-v-b commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Apr 22, 2026

Uh oh!

d-v-b commented Apr 27, 2026

Uh oh!

maxrjones commented Apr 27, 2026

Uh oh!

d-v-b commented Apr 27, 2026

Uh oh!

d-v-b commented Apr 27, 2026

Uh oh!

maxrjones commented Apr 27, 2026

Uh oh!

d-v-b commented Apr 27, 2026

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 22, 2026 •

edited

Loading